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(57) Abstract 

A method of reducing the perplexity of a speech recognition vocabulary and dynamically selecting speech recognition acoustic model 
sets used in a simulated telephone operator apparatus. The directory of users of the telephone network is subdivided into subsets wherein 
each subset contains the names of users within a certain location or exchange. A speech recognition vocabulary database is compiled for 
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APPARATUS AND METHOD FOR REDUCING SPEECH RECOGNITION 
VOCABULARY PERPLEXITY AND DYNAMICALLY SELECTING 

ACOUSTIC MODELS 

5 

Field of the Invention 

This invention relates to automatic speech 
recognition in telecommunication systems and to the use of 
such systems to provide large scale voice activated dialing 
10 and information retrieval services. 

Rarkoround to the Invention 

In the early development of telephone systems it 
was commonplace for a telephone subscriber to converse 

15 directly with a telephone operator at a telephone central 

office. The telephone subscriber would verbally request the 
telephone operator to complete a connection to a called 
party. As telephone exchanges were small the telephone 
operator was aware of virtually all of the subscribers by 

20 name and manually completed the requested connection. With 
the advent of dial telephone services, calls within an 
exchange could be completed automatically, and only certain 
toll calls required operator assistance. Today, operator 
assisted calls have become the exception and are usually 

25 comparatively expensive. Machine-simulated operator 

functions, including limited speech recognition services, 
have recently been available for expediting some typical 
operator-assisted functions. This includes "collect" long 
distance calls wherein completion of the connection is 

30 contingent upon the called party agreeing to pay for the 

service. However, these functions are limited to the simple 
recognition of "yes" or "no" so there is little room for 
non- functionality due to uncertainty as to which word was 
spoken. There have also been advancements in voice 

35 recognition services relating to directory assistance but 
these too are directed to a very limited vocabulary. 
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pi-ior Art 

The prior art contains several recent developments 
pertaining to voice recognition in general, and to voice 
recognition applicable to telecommunication systems in 
5 particular. 

U.S. Patent No. 5,091,947, which issued February 
25, 1992 to Ariyoshi et al, entitled "Speech Recognition 
Method and Apparatus", discloses a voice recognition system 
10 for comparing both speaker dependent and speaker independent 
utterances against stored voice patterns within a 
coefficient memory. The voice identification comparator 
selects the one voice pattern having the highest degree of 
similarity with the utterance in question. 

15 

In U.S. Patent No. 5,165,095, which issued on 
November 17, 1992, Borcherding discloses a voice recognition 
system to initiate dialog to determine the correct telephone 
number. According to the '095 patent, the calling party is 

20 first identified so that a database containing speaker 

templates can be accessed. These templates are then used to 
compare the dial command so that the dialing instructions 
can be recognized and executed. An example of a dialing 
directive in the patent is "call home", with "call" being 

25 the dial command and "home" being the destination 
identifier. 

Gupta et al, in U.S. Patent No. 5,390,278 issued 
February 14, 1995, disclose a flexible vocabulary speech 
30 recognition for recognizing speech transmitted via the 

public switched telephone network. This voice recognition 
technique is a phoneme based system wherein the phonemes are 
modeled as hidden Markov models. 

35 m spite of these ongoing developments, the 

functionality of automatic recognition of human speech by 
machine has not advanced to a degree wherein a calling party 
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can simply speak the called party's name and thereafter be 
connected as reliably as a human operator in situations 
where the database for a potential called party is very 
large (greater than 100 names) ♦ 

5 

Summary of the Invention 

The present invention is in the field of human 
speech recognition performed by machines and more 
particularly relates to a reduction of the perplexity of the 
10 speech recognition task in the context of names spoken by a 
telephone user in a telephone system . 

Individual users of telephone networks are divided 
into subsets to facilitate identification of the vast number 

15 of subscribers to the service. In the public network these 
subsets are local exchanges. Private switching networks 
such as the Nortel Electronic Switching Network (ESN) 
assigns individual ESN numbers to each location within the 
private network. The present invention relies on these 

20 subsets or location identifiers to reduce the perplexity of 
a speech recognition application. 

Therefore in accordance with a first aspect of the 
present invention, there is provided a telephone network 

25 including a plurality of telephone exchanges, each for 

serving a plurality of telephone terminals and each being 
interconnected with at least one other of the telephone 
exchanges for providing telephone communications between 
users of the telephone terminals. The network function 

30 includes a simulated telephone operator apparatus for 

receiving a speech request from a user for connection to 
another telephone user and to translate this request into a 
directory number for use by the appropriate one of the 
telephone exchanges. The translation is in accordance with 

35 a speech recognition algorithm and an active speech 

recognition vocabulary selected in accordance with the 
origin of the request. 
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in an ESN application the active speech 
recognition vocabulary is limited to the names of the 
individuals serviced by the ESN number. In a preferred 
embodiment the ESN number, which is also a location code, is 
5 contained in the first two or three digits of the directory 
number . 

in accordance with a second aspect of the 
invention there is provided a simulated telephone operator 

10 server for a telephone network. The server has means for 
storing voice utterances of a calling party telephone user 
and means responsive to location information in association 
with the telephone user for selecting an active speech 
recognition vocabulary. Speech detection means are provided 

15 for processing the stored voice utterance in accordance with 
a speech recognition algorithm and the active speech 
recognition vocabulary. Directory lookup means identify a 
directory listing of a called party corresponding to a 
result of the processing by the speech detection means. The 

20 server also includes means for transmitting the directory 
listing to a telephone exchange serving the called party. 

in accordance with a further aspect of the 
invention there is provided a telephone exchange comprising: 
25 a plurality of ports for serving a plurality of telephone 
users' telephone instruments via telephone lines; a trunk 
facility for connection to another telephone exchange; a 
switching network for connecting and disconnecting the 
telephone instruments; a controller means for causing a 

30 newly OFF HOOK telephone instrument to be coupled via the 
switching network with a solicitation signal, and 
subsequently for being responsive to a telephone number 
received in association with the newly OFF HOOK telephone 
instrument for completing a telephone call via the switching 

35 network. The exchange also includes an originating register 
means for storing voice band signals received from the newly 
OFF HOOK telephone instrument via the switching network. 
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Means are provided for detecting digits represented by- 
frequency signals, within the stored voice band signals, in 
accordance with a standard for key pad dialed telephone 
numbers and for transmitting detected digits to the call 
5 controller. A simulated telephone operator apparatus 

receives and translates voice band signals in accordance 
with a speech recognition algorithm and an active speech 
recognition vocabulary selected in accordance with the 
origin of the voice band signals into a directory number for 
10 use by the controller means. An interface facility is 

provided for transmitting the stored voice band signals via 
the switching network to the simulated telephone operator 
server apparatus in an event wherein the voice band signals 
did not include a key pad dialed digit. 

15 

In accordance with yet a further aspect of the 
present invention there is provided a method of detecting a 
voiced speech request of a calling party for connection to 
another user of an automatic telephone exchange. The method 

20 comprises storing a plurality of speech recognition 

vocabularies in association with geographic location of 
users; receiving the voiced request and information as to 
the geographic location of the user having voiced the 
request from the automatic telephone exchange; selecting an 

25 active speech recognition vocabulary in accordance with the 
information as to the geographic location of the user and, 
in accordance with a speech recognition algorithm and the 
selected active speech recognition vocabulary, translating 
the received request into a directory number for use by the 

30 automatic telephone exchange in setting up a telephone 

connection between the calling telephone user and the other 
telephone user. 

Prief Pescrjption of che Prewiiws 

35 The invention will now be described in greater 

detail with reference to the attached drawings wherein: 
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FIGURE 1 is a block diagram illustrating trunk 
connections between private switch locations; 

FIGURE 2 is a block diagram of the system hardware 

architecture; 

FIGURE 3 is an overall system state diagram; and 
FIGURE 4 is a state diagram of the key word 

handler . 

pofflTlPd rescr ipt- ion o<= t-h* Tnvpntion 

The following description relates to an 
enterprise-wide speech directory calling service within a 
company or corporation having a number of locations. Each 
location is assigned a unique electronic switching network 
(ESN) location code or ESN number. As shown in the block 
diagram of FIGURE 1, the on-site PBX 20 at each location is 
connected to each other location via trunk connectors 22. 
In this discussion the ESN comprises a three-digit code to 
identify the location. It is to be understood, however, 
that it is not essential to use all three digits to identify 
0 the location as it may be sufficient to use the first two 
for example. 

FIGURE 2 illustrates the hardware architecture in 
accordance with a preferred embodiment of the invention. As 
5 shown, PBX 20 is connected to trunk 22 and to a plurality of 
on site telephone sets as known in the art. The speech 
recognition system 30 of the invention is connected to the 
PBX 20 via Tl line 32 via Tl card 34 and via signal link 36 
and signal link card 38. Speech recognition system 30 
30 includes a speech recognition processor operating on a 

speech recognition algorithm, central processor and control 
units as well as memory cards for storing active speech 
recognition vocabulary data bases. 

35 Although FIGURE 1 refers to a private switching 

network using ESNs, it is to be understood that the 
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invention is not limited to such networks but can also be 
adapted to use in public switching systems. 

One objective metric used to measure the accuracy 
5 of a speech recognition system is. the Word Error Rate (WER). 
The WER is defined as the total number of incorrectly 
recognized words made by a speech recognition system divided 
by the total number of words spoken by a user of the system. 

yygg _ NwnberofErrorsMadeby Re cognizer 
NumberqfWordsSpokenbyUser 



The present invention makes use of information as 
to the calling party's location for automatically assisting 
in improving the WER of a speech recognition system on a 
15 spoken called party's name for the purpose 6f connecting a 
telephone call. 

It has been empirically shown that the WER of a 
speech recognition system will vary with the square root of 

20 the perplexity of the vocabulary of words being recognized. 
[Kimbal, O. , et al., "Recognition Performance and 
Grammatical Constraints", Proceedings of a Workshop on 
Speech Recognition, Report Number SAIC-86/1546 , Defense 
Advanced Research Projects Agency, Palo Alto, February 19- 

25 20, 1986.] 

WER « ^Perplexity 



The perplexity of a vocabulary is defined as the 
30 measure of the constraint imposed by a grammar, or the level 
of uncertainty given the grammar of a population of users. 
Perplexity is mathematically modeled and quantified in the 
following way: 



35 



r lw 6 v 
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5 = 2" 

where: H is entropy 

P(w) is the probability of w being spoken 

5 B is the perplexity of the application 

The vocabulary of words in this implementation 
consists entirely of proper names; location names, and a 
small number of key words for command and control. For 

10 large corporations with a large number of employees, the 

proper names become the determining factor in measuring the 
perplexity since the number of employees will overwhelm the 
number of location names and key words. Thus location names 
and key words can be ignored in this calculation. If we 

15 make a simplifying assumption that every name is spoken with 
equal probability, then the equation above can be simplified 
to the following: 



20 



25 



Perplexity = 

where: L is the average number of words in a sentence 

S is the number of sentences in the vocabulary V 

If we further make the simplification that proper 
names contain two words first and last name -- and the 
number of sentences in the vocabulary is equivalent to the 
number of employee names, then we can further reduce the 
equation to the following: 

30 Perplexity = ^/jsj 

If we make the assumption that the amount of 
conf usability between names within a large database will be 
similar between large databases, the accuracy of a speech 
35 recognition system has the following relationship with the 
number of names in the vocabulary: 
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WER NumberofActiveDirectoryNames 

We can observe from the above equations that the 
WER increases with the perplexity and thus increases with 
5 the number of proper names in the vocabulary. 

In the past, speech recognition scientists have 
used various methods to reduce the perplexity in an effort 
to improve the WER of a speech recognition system. To 
10 achieve this result, most of these efforts have been focused 
at the linguistic level. For example, scientists have used 
statistical language models and linguistics rules of 
phonology to reduce perplexity or uncertainty in recognizing 
a spoken word or phrase . 

15 

In this implementation the list of employee names 
for each location is stored in a separate speech recognition 
vocabulary. The employee name will normally be associated 
with the four digits of the telephone number following the 

20 three-digit ESN or location code. According to the system 
of the present invention a calling party wishing to speak to 
another employee at the same location will simply announce 
the first and last name of the employee to whom a connection 
is desired. The speech recognition system will assume that 

25 calling party and called party are at the same location and 
load the active speech recognition vocabulary database 
containing the names of all employees at that location. 
Using a conventional speech recognition algorithm the name 
spoken by the calling party is compared by the system 

30 against the names of all employees in the database and the 
closest match is selected. The name selected is announced 
to the calling party and the call is automatically connected 
to the line associated with the telephone number assigned to 
the called party unless the calling party interrupts the 

35 process by saying, "No. M Thereafter the voice recognition 
system releases from the call. 
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If the called party is at a different location 
than the calling party, the calling party will first 
announce the location of the called party followed by the 
called party's name. The voice recognition system responds 

5 by announcing the location and subsequently loading the 
active voice recognition vocabulary database including the 
names of all the employees at the announced location of the 
called party. The voice recognition system then selects the 
name in the loaded database that most closely matches the 

10 name of the called party. The selected name is announced to 
the calling party and the call is automatically connected to 
the line associated with the telephone number assigned to 
the called party unless the calling party interrupts the 
process by saying, "No." Thereafter the voice recognition 

15 system releases from the call. 

Because the active voice recognition vocabulary 
set associated with each ESN or location contains only a 
portion of the total number of employees of the corporation 
20 or company, the WER is much lower than it would be if the 
complete employee directory was loaded in the database. 

The actual decrease in the corporate wide WER 
(C_WER) is contingent upon how evenly the employees are 
25 spread over the different sites. In the best case where the 
employees are evenly distributed in each site, C_WER will 
decrease according to the following relation. 

C_WER = 

Tin 

30 where: n is the number of sites. 

In the worst case, where there is only one 
employee in each site, except for one site which holds all 
of the remaining employees, there will be a negligible 
35 decrease in the C_WER. 



C_WER « ^}(m-n) 
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where: m is the number of employees in the company. 
C_WER = WER 

5 

for: m >> n 

In a similar way that ESN information is used by 
the speech recognition system to dynamically load the active 

10 vocabulary set, the ESN information can also be used by the 
speech recognition system to select the appropriate acoustic 
model set. Speech recognition systems use previously 
collected speech samples to serve as reference templates 
against which new spoken speech samples are matched for 

15 classification. Statistical pattern recognition techniques 
are used to match new speech samples against reference 
templates to determine the closest match. These reference 
templates are refereed to as acoustic models in the speech 
recognition system. Acoustic models may vary according to 

20 the regional accent and subsequently according to ESN 
locations. The speech recognition system can use site- 
specific acoustic models that are dynamically loaded based 
on the ESN information presented at the time of the call. 
Having site-specific acoustic models will also decrease the 

25 WER of the system. 

The following specification illustrates an example 
of the NORTEL Speech Directory Calling Service. The state 
diagram shown in FIGURES 3 and 4 describes the user 
30 interface that users of the service experience and is not 

meant as an implementation specification. Some parts of the 
system, such as error recovery and instructions have been 
omitted. 

35 In the description that follows, the use of 

italics denotes system state and the use of a dollar sign 
symbol denotes a parameter. 
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p^rriotion n f the sr.ar.es in Alphabetical Order; 



/* Go to Idle anytime a user hangs up */ 
On an incoming call 
Get ESN information 

Set $Location based on ESN information 
go to Listening Timeout 



Cancel ; 
5 Play Who 

go to Listening Timeout 

Idle; 

10 



15 

^yyjnr-d Handler: 
Case 

Service Locations: 
Receptionist : 
20 Cancel: 
End Case 



go to Service Location 

go to Transfer Receptionist 

go to Cancel 



Kn^wn Loc: 

Set $Location to $RecognizedWord 
25 Play $Location 

Play Employ eeName 

go to Listening Timeout 



30 T.i sterling Timeout: 
Listen for $Timeout 
If the user speaks 
go to Speech 

Else 

35 go to Prompt 
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hoc Handler; 

If $Location is known location 
go to Known Loc 

Else 

5 go to Unknown Loc 



Prompt : 



Case (state before Listening Timeout ) 
10 Idle: 

Play Who 

go to Listening Timeout 

The rest of the states: 
15 When $Timeout expires on the first two times 

Play TimeoutHelp. $Location 
$Timeout = 4 sec 
go to Listening Timeout 
When $Timeout expires on the third time 
20 Play Difficulties 

go to Transfer Receptionist 

End Case 



Service Location; 
25 Play ServiceAvailable 
Play $Location list 
Play Who 

go to Listening Timeout 
30 Speech; 

Load the active vocabulary set from $Location 
Get $RecognizedWord from Speech Recognizer 



35 
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Case ($RecognizedWord) 



Rejection: 
$Name : 



go to Rejection Handler 
go to Transfer Call 



$Location: 
Key Word: 



go to hoc Handler 
go to Keyword Handler 



10 



15 



20 



End Case 

Trprjxfer Call: 

Database Lookup for Employee Phone Number 
Transfer the call 
go to Idle 

Transfer Receptionist ; 

Play Transf erReceptionist 

Transfer the call to the receptionist 

go to JdJe 

ynknnwn hoc: 

Play NotServiced. $Location 
go to Listening Timeout 

Tnrfgx of t.he Prerecorded P romnts in Alphabetical Order ; 

Calling: 

Calling $Name? 

Difficulties: 

The system is having difficulties with your request. 

Transferring to a receptionist. 

Empl oyeeName : 

Employee name? 

NotServiced: 



This service is not available in SLocation. Choose 
another location or for a list of serviced ESN locations, 



say "Service Locations" . 
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Servi ceAva i 1 abl e : 

This service is available for the following Nortel /BNR 
locations: $Location list. 

5 

Trans ferReceptionist : 

Transferring to a receptionist. 

Who: 

10 Who would you like to call? 

A specific embodiment of the invention has been 
disclosed and illustrated. It will be apparent to one 
skilled in the art that various changes in methodology 
15 and/or approach can be made without departing from the 
spirit and scope of this invention as defined in the 
appended claims . 



20 
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I CLAIM: 

1. A telephone network including: 

a plurality of telephone exchanges each for 
5 serving a plurality of telephone instruments and each being 
interconnected with at least one other of the telephone 
exchanges, for providing telephone communications between 
telephone users associated with the telephone instruments; 
and 

10 a simulated telephone operator apparatus for 

receiving a voiced speech request from a user for connection 
to another of the telephone users and translating said 
request into a directory number for use by one of the 
telephone exchanges in accordance with a speech recognition 

15 algorithm and an active speech recognition vocabulary 
selected in accordance with the origin of the request. 

2. A simulate'd telephone operator server for a 
telephone network comprising: 

means for storing voice utterances of a calling 

party telephone user; 

means responsive to location information in 
association with the telephone user for selecting an active 
speech recognition vocabulary; 

speech detection means for processing the stored 
voice utterances in accordance with a speech recognition 
algorithm and said active speech recognition vocabulary; 

directory lookup means for identifying a directory 
listing of a called party corresponding to a result of said 
processing by the speech detection means; and 

means for transmitting the directory listing to a 
telephone exchange serving said called party. 

3. A simulated telephone operator server as defined 
in claim 2, wherein the directory lookup means defaults to 
identification by a telephone attendant directory listing in 
the event of there being no called party directory listing 



20 



25 



30 



WO 97/37481 PCT/CA97/00008 

17 

corresponding to the result of said processing by the speech 
detection means. 



4. A telephone exchange comprising: 
5 a plurality of ports for serving a plurality of 

telephone users' telephone instruments via telephone lines; 

a trunk facility for connection to another 
telephone exchange; 

a switching network for connecting and 
10 disconnecting the telephone instruments; 

a controller means for causing a newly OFF HOOK 
telephone instrument to be coupled via the switching network 
with a solicitation signal, and subsequently for being 
responsive to a telephone number received in association 
15 with the newly OFF HOOK telephone instrument for completing 
a telephone call via the switching network; 

an originating register means for storing voice 
band signals received from the newly OFF HOOK telephone 
instrument via the switching network; 
20 means for detecting digits represented by 

frequency signals, within the stored voice band signals, in 
accordance with a standard for key pad dialed telephone 
numbers, and for transmitting detecting digits to the call 
controller; 

25 a simulated telephone operator apparatus for 

receiving and translating voice band signals in accordance 
with a speech recognition algorithm and an active speech 
recognition vocabulary selected in accordance with the 
origin of the voice band signals into a directory number for 

30 use by the controller means; and 

an interface facility for transmitting the stored 
voice band signals via the switching network to the 
simulated telephone operator server apparatus in an event 
wherein the voice band signals did not include a key pad 

35 dialed digit. 
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5. A telephone exchange as defined in claim 4, 
wherein the call controller means is operative to cause the 
interface means to transmit said stored voice band signals 
via the switching network to the simulated telephone 
5 operator server apparatus in an event wherein the voice band 
signals included a key pad dialed digit designating the 
simulated telephone operator apparatus. 



6. A simulated telephone operator apparatus for 
10 receiving a user voiced speech request for connection to 

another user of a telephone network and translating said 
request into a directory number for use by an automatic 
telephone exchange, in accordance with a speech recognition 
algorithm and an active speech recognition vocabulary 
15 selected in accordance with the origin of the request. 

7. A method for detecting a calling telephone user 
voiced speech request for connection to another telephone 
user via an automatic telephone exchange comprising: 

20 storing a plurality of speech recognition 

vocabularies in association with geographic locations of 
users; 

receiving the voiced speech request and 
information as to the geographic location of the user having 
25 voiced the speech request from the automatic telephone 
exchange; 

selecting an active speech recognition vocabulary 
in accordance with the information as to the geographic 
location of the user; and 

30 in accordance with a speech recognition algorithm 

and the selected active speech recognition vocabulary, 
translating the received request into a directory number for 
use by the automatic telephone exchange in setting up a 
telephone connection between the calling telephone user and 

35 said another telephone user. 
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